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1. INTRODUCTION 

A word embedding is a way to represent words using a dense vector representation. It is an 
improvement over more traditional bag-of-words model where they used large sparse vectors to represent each 
word in entire vocabulary. Since the size of the vocabulary was vast, these representations had to be sparse. So 
the given word or documentation would be represented with sparse vectors comprising mostly with zero values. 
However, in an embedding, words are expressed by dense vectors, in which a vector means projecting the word 
into a continuous vector space. 

There has been a surge of work that propose word embedding using diverse training schemes based 
on neural-network language modeling like [1]-[3]. Distributed vector representations of words can capture 
meanings of the word. Word embedding, in other words, is crucial in learning algorithms to get higher 
improvement in natural language processing tasks like [4], [5]. There were various approaches to represent the 
word by distributed vector, we propose a new approach to make a distributed vector representation. In the 
Word2Vec model (Continuous Bag-of-Words (CBOW), Skip-gram) in [6], it outputs a feature matrix of words. 
While training, there are 2 matrices which is created between input layer and output layer. In several previous 
works, it has already been proven that output vector can acts as a word embedding and performs almost as 
good as an input vector. Note that we can call the input vector as input embedding and the output vector as 
output embedding. What we are going to utilize in this paper is input embedding and output embedding from 
the Word2Vec model. We get the input embedding matrix and output embedding matrix after training the 
words to have distributed vector representations. We propose a better embedding by combining input 
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embedding and output embedding in various ways. It outperforms the embedding that use only the input 
embedding as the Word2Vec presents. We know that there are many works that represent words almost 
perfectly with pre-trained vectors like [7]-[9], but the main part of the proposed scheme is that the simplest 
way is utilized to represent words using the basic Word2Vec model. Thus comparing with the basic model, 
this method’s performances are remarkable. It may not be the state-of-the-arts performance in making word 
embeddings, but we are presenting various ways of utilizing input and output embeddings. 

Input embedding and output embedding can both serve as word embedding. We use both of these 
embeddings to derive richer distributional relationships. It has been shown that combining embeddings results 
a better word embedding than using it individually. Different from other papers, we simply use only the 
embeddings from Word2Vec model, while they use other embeddings from the other models. In this paper, we 
tried various ways to combine input embedding and output embedding to better word embedding that represents 
words well. We compare the quality of each individual embeddings, input and output, and the combination of 
those embeddings by word analogy task, word similarity task and comparing nearest neighbors to see which 
method of combination performs better. 

The main parts of this paper are as shown in: 

- We propose various and efficient ways to represent the words better using input embedding and output 
embedding fromWord2Vec model. 

- | We compare the performance of input embedding and output embedding with each of dual embeddings 
in various evaluation methods like word analogy task, word similarity task and nearest neighbors. 

- Our idea of dual embedding is the simplest way of representing words comparing with recent works. 

We will explain how our idea of dual embedding came from in section 2 with related works for this 
paper. And in section 3, we will talk about our dual embedding models one by one. Then we will use our 
various embeddings came from dual embedding models to evaluate and compare the embedding's quality with 
input embedding and output embedding in section 4. So the next section 5 will be the conclusion of this paper. 


2. RELATED WORKS 
2.1. Word representations 

The Word2Vec model was first introduced by Sonkar et al. [6] to learn high-quality word 
representations from large data with billions of words. Their models are effective at capturing semantics and 
syntactics of the words measured in a word analogy task, which is useful for various natural language 
processing tasks. There were some trials to make Word2Vec a better model with various training 
methodologies like casting the Skip-gram with negative sampling (SGNS)'s training scheme as weighted 
matrix factorization [10]. Meanwhile, some works explained the Word2Vec model's negative sampling in 
details [11] and about parameter learning in details [12]. Hambi and Benabbou [12] mentioned about the "input 
vector" and "output vector" that comes from the Word2Vec model while training. 


2.2. Awareness of the output embedding 

There were some attempts to use both this input vector and output vector in [13], [14] to find out the 
usefulness of the output vector. Li and Summers-Stay [13] observed that output vector in a Word2Vec model 
can also be useful. They retrained both the embedding spaces to obtain more distributional relationships. They 
said Word2Vec model contains two separate embedding spaces(input and output) whose interactions capture 
additional meanings of words that cannot be found in each embeddings [15]. So they combined embeddings to 
leverage both the embeddings spaces and they used it for query and document ranking. 

Similar to that, Nalisnick et al. [14] tried to improve the model for better improvement for information 
retrieval (IR). They said that for certain IR tasks, they postulate that they should combinedly use both the IN 
and the OUT embeddings. The meaning of dual embedding with input embedding and output embedding by 
[13] and [14] is that they mapped query words into the input domain and the document words into the output 
domain. 

According to Press and Wolf [16], with the Word2Vec Skip-gram model, the quality of output 
embedding is almost as good as the quality of input embedding tested on five embedding evaluation methods. 
They suggested the tied model with input and output embeddings which leads to an improvement in the 
perplexity of various language models. While they use two embeddings for their papers, there were some 
several works that worked on utilizing input embedding and output embedding by demonstrating the 
effectiveness of the output embedding. 


2.3. Combining embeddings 


There were some methods that help to combine embedding vectors. Garten et al. [17] tried to combine 
vectors generated from different models such as distributed vector representation in sigma (DVRS) [18] and 
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Word2Vec. Tsuboi [19] showed the way to combine Word2Vec and GloVe embeddings into a part-of-speech 
(POS) tagging task. They demonstrated that using these two embedding sets together is beneficial than using 
them individually by improving the tagging accuracy. Also Jiao and Zhang [5] starts from the motivation that 
semi-supervised approaches can improve accuracy. For that, they combined two public embeddings, circular 
watermarking (CW) embedding [2] and hierarchical log-bilinear (HLBL) embedding [3], to show better 
performance than using these embeddings individually. A multi-view word embedding scheme using two-sided 
neural network was proposed [20]. They tried to make several embeddings by training CBOW model on 
various datasets like Wikipedia corpus, search click-through data and user query data. They combined these 
embeddings trained on different datasets and showed that using these embeddings together gives stronger 
results than using them individually. 

Goiknetxea et al. [21] used concatenation of the word embeddings trained from different corpus and 
WordNet and improved the performance. Yin and Schiitze [22] proposed various methods of combining five 
different public embedding sets like Word2Vec [6], [23]-[28], GloVe [29], and CW [2], HLBL [3], and Huang 
et al. [30]. They introduced concatenation (CONC), all known words (AVG), singular value decomposition 
(SVD), and 1ToN to combine these three embeddings to better represent the words. And similarly, Coates and 
Bollegala [31] introduced autoencoder method to combine those public embeddings. These previous works 
showed combining embeddings performs better than using one embedding alone. 


3. PROPOSED METHOD 

Word2Vec model introduced by Sonkar et al. [6] is a neural network-based technique which is based 
on distributional hypothesis that learns word embedding from the context words. The model comes from the 
situation that words in similar contexts hold similar meanings. The Word2Vec learns word representations 
through skip-gram model and continuous bag-of-words (CBOW) model. Continuous bag-of-words (CBOW) 
model is trained by predicting the target word based on the context words. This learns a word's embedding 
through maximizing the log probability of the word from the context words in the window. The Skip-gram 
model is similar, but completely opposite to the CBOW model, it predicts the context words founded on the 
target word. It learns word embedding for each word both in an input embedding matrix and in an output 
embedding matrix. There are two matrix in the model, first weight matrix is the one that is between an input 
layer and a hidden layer. In Figure 1, W;,, is the input weight matrix of VxN. Note that V is the vocabulary size 
of the embedding and N is the hidden layer size. W;, is the weight matrix that is returned as a word embedding. 
And the second weight matrix is generated in the middle of the output layer and the hidden layer. This is the 
output matrix W,,,, of NXV in Figure 1. We update these two matrices when we train context words and target 
words. Normally, the input weight matrix W;,, is the returned vector to use as a word embedding of Word2Vec 
while output weight matrix W,,,; is abandoned. It means, by default, Word2Vec discards output embedding 
after training, and then outputs only the input embedding. However, in this paper, we used both the input weight 
matrix and the output weight matrix to better represent the word. Note that we call input weight matrix Win as 
input embedding emb;,, and output weight matrix Wout as output embedding embout. 
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Figure 1. Learning process of the Word2Vec model 


3.1. Concatenation 

We simply came up with an idea of concatenation and sum to combine two embedding vectors into 
one embedding vector. We tried to combine input embedding and output embedding to capture both of their 
features. Actually the method of concatenating embeddings was used in [22] where they concatenated five 
public embeddings. They found out that concatenation of the embeddings is effective method for a particular 
word. We did it similarly, but the only difference is that we concatenated only two embedding vectors from 
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one model. In other words, we concatenated input embedding and output embedding from the Word2Vec like 
(1). And then we did L2 normalization for the embcgyc. Combining those two embeddings into one vector, 
results the dimension size to be the double of each embeddings dimension. This causes a increase in training 
parameters than having an original dimension. We used this embedding as CONC embedding. 


embconc = embin Ð embout (1) 


3.2. Sum 

The sum embedding is the result of adding the input embedding and the output embedding element- 
wise like (2). The dimension size of this embedding is the same as the input embedding and output embedding. 
Bao and Bollegala [32] proposed the method of averaging the embeddings to combine in one vector. They 
proved that if word embeddings are shown to be approximately orthogonal, then, without increasing the 
dimensionality, averaging the embeddings will have the same information as concatenation. But in this paper, 
we tried both ways, averaging the embeddings like [32] and just adding the embeddings, not dividing into 2. 
Then compared the results of those embeddings, just adding two embeddings without dividing into 2 performed 
well. So we used this embedding as SUM embedding (emb,,,,,) in further experiments. 


eMDeym = eMbin + embout (2) 


3.3. Auto encoder 

Auto Encoder is an unsupervised way of finding data features only from the data input. This method 
was introduced in [31] which combines other word embeddings, e.g. Word2Vec and GloVe. However, in this 
paper, we use only input embedding and output embedding from Word2Vec. 

We used the result of CONC embedding, i.e. concatenation of the input embedding and output 
embedding, as our input to the autoencoder. The goal is to make the reconstructed matrix in the output layer 
similar with the input layer's original matrix by minimizing the total reconstruction error. While we trained the 
concatenation embedding in an autoencoder, we randomly initialize the matrix at first, and we did not use any 
activation functions. As we train this model with (3), the matrix in the hidden layer learns from the input layer, 
which is the concatenation of the input embedding and output embedding. 


loss = ¥ || (emb;, D embout)! — (embin ® embout) I? (3) 


It learns both of the features from the both embedding. The matrix in the hidden layer, called 
compressed matrix, dimension size is half of the input dimension size because it extracts the data features from 
the input layer. As a result of the autoencoder embedding, we used the compressed matrix in the hidden layer 
as our dual embedding with autoencoder. This embedding has smaller dimension than CONC embedding, the 
original input to the autoencoder. So, we get the compressed dimension of the embedding while containing the 
input embedding and output embedding's information. We named this word embedding autoencoder based 
CONC (AE-CONC) because we used CONC embedding as our input. 

We tried various different inputs to the autoencoder. First we tried CONC embedding to the input to 
get the same dimension of the input embedding and output embedding. For various experiment to get the better 
word embedding, we tried the SUM embedding as our input to the autoencoder. We named this embedding 
AE-SUM. Also we made SUM embedding using the weight ratio by 8:2 when adding input embedding and 
output embedding. We decided this weight ratio 8:2 heuristically. We named this embedding AE-SUMR. The 
dimension of this AE-SUM and AE-SUMR would be the half of the dimension of input embedding and output 
embedding. We can express words by smaller dimension with these embeddings. 


3.4. Singular value decomposition 

Singular value decomposition (SVD) is a way of decomposing the embedding matrix to such shapes. 
SVD has been utilized in diverse tasks in natural language processing like [33], [34] to get the reduced 
dimensionality of a feature space. The proposed method in combining vectors was introduced in [22]. They 
used the embedding of concatenation, CONC embedding, to the input to reduce the dimensionality. But instead, 
we used SUM embedding matrix compared to the better results. We used only the method to combine input 
embedding and output embedding. With C = USV" using the matrix of size nxk, for the result, U gets unitary 
matrix of size nxk, S gets diagonal matrix of size nxn, and V gets unitary matrix of size kxk. In this paper, n is 
the vocabulary size of the embedding and k is the embedding dimension size. We used the SUM matrix of the 
input embedding and output embedding as an input to C in this equation and we used U as our final embedding 
for SVD. We applied L2-normalization to the embeddings. SVD performs dimension reduction. For the various 
experiments, we also tried embedding matrix of adding input embedding and output embedding by the weight 
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ratio of 9:1 respectively. We named this embedding singular value decomposition ratio (SVD-R) embedding 
in further experiments. 


3.5. 2tol 

2tol model is originated from 1toN model in [22]. 1toN embedding results fine-tuned meta embedding 
which contains knowledge from all individual embedding sets like word2vec [6], GloVe [29], class-weighted 
(CW) [2]. Different from the 1toN model, we train the word vectors from 2 embeddings, input embedding and 
output embedding. We first randomly initialize the embedding and then trained the vector from input 
embedding and output embedding with the loss function introduced in [35] to update the word embedding 
matrix efficiently which has enormous vocabulary size. Like in Figure 2, each loss loss;n, losSoy¢ from each 
embeddings emb;,, €Mbout is used to train embz,,,. This method successfully replaces the way the Softmax 
function is applied to all the values of the output layer. 

We use this dual embedding to predict representations of the word in the individual embedding sets 
by projections. Also we used parameter a to find the best combination of the input embedding and the output 
embedding as shown in (4). This model makes the vector to have more meaningful embedding because they 
learn each knowledge from both embeddings. 


losStotat = loSSin X @ + lOSSout X (1 — @) (4) 
embin embout 
losSin losSout 
emb aint 


Figure 2. Visualization of using loss function in 2tol model 


4. RESULTS AND DISCUSSION 

In this paper, we utilized input embedding and output embedding from the Word2Vec model, to put 
it concretely, the Skip-gram model, trained on dataset from “One Billion Word Language Modeling 
Benchmark” which consists of almost 1 billion words, and the text are already pre-processed. We set the 
vocabulary size to 229842, which will consist of words with high frequency, discarding the words that occur 
rarely. Input embedding and output embedding are both 300 dimensional vector. 

The proposed dual embeddings are quantitatively evaluated on word analogy and similarity tasks, and 
then qualitatively on nearest neighbors of several words. We tried several ways to combine input embedding 
and output embedding as our dual embedding, and got several embeddings such as CONC embedding, SUM 
embedding, AE-CONC embedding, AE-SUM embedding, AE-SUMR embedding, SVD embedding, SVD-R 
embedding, and 2tol embedding. We compared our dual embedding performances with each individual input 
embedding and output embedding as well as just concatenating and adding. 


4.1. Word analogy task 

We used semantic-syntactic word relationship test set from [6] to measure the quality of our 
embeddings. They have 8869 semantic and 10675 syntactic questions, which the semantic questions have 
categories like a male-to-female relationship. The questions is a list of 4 words which is 2 set of similar word 
pairs with 2 words like “he” : “she” :: “man” : “woman”. We need to find the last word in the closest word list 
computed with other 3 words. For example, we have to find the closest word to vec(x) by cosine distance 
computed with vec(“she”) - vec(“he”) + vec(“man’’). The closest word needs to be exactly the last word in a 
set (the word “woman” in the above example) to count as a correct answer when we evaluate the accuracy. 

The performance of the word analogy task is reported in Table 1. It is divided in semantic accuracy, 
syntactic accuracy, and total accuracy of the word analogy task for semantic-syntactic word relationship test 
set. The top 2 results are the input embedding and output embedding's results individually. We can see it at 
once that individual input embedding and output embedding as a word embedding perform poorly than any 
other dual embeddings. We observed that when we use output embedding in our experiment by combining 
with input embedding, the results of combined embeddings better perform than using only the input embedding. 
Surprisingly, when we combine input embedding in the ways of we proposed, the performance increased by 
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just only using output embedding with input embedding. These results in Table 1 demonstrate that our 
hypothesis, it would be efficient to use both of the embeddings, input and output, was right. 

We found some interesting things in this experiment. Specifically, in the semantic part, SVD 
embedding performs the best in these embeddings. However, in the syntactic part, CONC embedding 
outperforms the others. It is interesting that CONC embedding and SUM embedding performs well in syntactic 
task with simply concatenating or adding the input embedding and output embedding. Especially, 2tol 
embedding, made with the model we proposed, performs best in word analogy task among these embeddings 
including other dual embeddings which combine input embedding and output embedding. This show that 2tol 
model has advantage on analogizing the word by forward propagating both the input embedding and output 
embedding. 


Table 1. Accuracy on word analogy task 


Embeddings Semantic Syntactic Total 
input 64.4 66.5 65.6 
output 67.0 68.3 67.7 
CONC 69.6 69.6 69.4 
SUM 79.7 69.4 73.9 
AE-CONC 71.6 67.3 69.2 
AE-SUM 69.4 65.6 67.3 
AE-SUMR 78.8 67.6 72.7 
SVD 82.5 65.3 73.1 
SVD-R 79.0 67.0 724 
2tol 81.9 68.1 74.3 


4.2. Word similarity task 

We experimented the performance of the embeddings by Spearman rank correlation on word 
similarity task. A similarity score is obtained from the embedding vectors by calculating the cosine similarity 
after normalizing each feature across the vocabulary. Spearman's rank correlation coefficient is computed in 
the middle of this score and the human judgments. Table 2 shows the results by the percentage of the 
coefficient. We used Rubenstein-Goodenough (RG) dataset [36] with 65 word pairs, Miller-Charles (MC) 
dataset [37] with 30 word pairs, SimLex-999 (SL-999) dataset [38] with 999 word pairs , and rare word (RW) 
dataset [39] with 2034 word pairs in this word similarity task. 

We tried word similarity task on individual input embedding and output embedding and other dual 
embeddings such as CONC, SUM, AE-CONC, AE-SUM, AE-SUMR, SVD, SVD-R, 2tol embeddinngs as 
shown in Table 2. To see it generally, autoencoder embedding, especially autoencoder with SUM embedding 
(AE-SUM) outperforms the other dual embeddings in MC and RG dataset. Since MC and RG dataset have few 
word pairs compared to SL-999 dataset and RW dataset, AE-SUM embedding performs well because this 
embedding contains information of input embedding and output embedding in smaller dimensionality. 


Table 2. Spearman rank correlationi coefficient for word similarity task 


Embeddings MC RG SL-999 RW 
input 70.12 72.38 48.52 35.64 
output 68.63 64.10 44.20 45.72 
CONC 68.59 64.37 44.81 46.03 
SUM 72.66 70.23 44.89 46.55 
AE-CONC 78.58 78.59 44.53 43.76 
AE-SUM 79.78 79.01 38.47 41.54 
AE-SUMR 76.84 77.33 44.18 43.89 
SVD 74.93 76.42 45.25 49.50 
SVD-R 76.56 78.58 48.54 41.88 
2tol 74.02 71.37 43.44 47.82 


Interestingly, in the SimLex-999 dataset, SVD with ratio of 9:1 (SVD-R) embedding outperforms way 
better than other dual embeddings, needless to say, also better than only input embedding and output 
embedding. However, in RW dataset, SVD and 2tol model generally performs better than each input and 
output embeddings and they are even better than simply adding input embedding and output embedding in 
word similarity task. Since RW dataset is consists of rare words, we can know that SVD embedding is good at 
capturing rare words features, i.e. powerful at representing the rare words. 
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4.3. Nearest neighbors 

We selected several words and their nearest neighbors to show the qualitative results for input 
embedding, output embedding, and our dual embeddings. We did this experiment only on 3 dual embeddings, 
thinking that these are the representatives of the dual embeddings. In this Table 3, one of the embeddings, 
AutoEncoder means AE-SUMR embedding as a representative of all AutoEncoer embeddings. 

The words in Table 3 are ‘language’, ‘eminem’, ‘unflagging’, and ‘remonstrate’, and ‘reprobate’. It 
consists of 2 frequent words (‘language’, ‘eminem’) which we all know, and 3 rare words(‘unflagging’, 
‘remonstrate’, ‘reprobate’) that is hard to represent with the embedding. With 2 frequent words, in all 
embeddings like input embedding, output embedding, autoencoder, SVD, and 2tol embedding have related 
words to each words in the results of nearest neighbors. 


Table 3. Nearest neighbors with several words on dual embeddings 


language eminem unflagging remonstrate reprobate 
input languages rapper instilled yelled shortcake 
vocabulary kanye marvles bargate arand 
english coldplay unwavering heurelho sacramen 
dialect rap urbanity kraig guzzles 
phrases rappers rediscovering reasoning loudmouth 
output phonetic outkast unquenchable remonstrated 7david 
aramaic soulja unswerving remonstrating — turnblad 
dialects timbaland pasquerilla rangana guidenstern 
idioms tinchy untiring olimpico claireece 
dialect jeezy unstaining skomina deerhound 
AE languages rapper unwavering ovrebo philanderer 
arabic interscope tenacity jeered hissy 
english album dedication linesman druggie 
dialect timbaland unfailing whistled alcoholic 
fluent grammy unswerving referee loveable 
SVD afrikaans rapper unwavering remonstrating alcoholic 
dialect dre unswerving remonstrated — etonian 
fluent interscope unstinting liaise codger 
pashto ludacris unfailing berserk forma 
dialects rihanna unquenchable unheeded curmudgeon 
2tol english rapper unwavering linesman loudmouth 
languages rap dedication remonstrated druggie 
arabic rappers devotion jeered dim-witted 
vocabulary album tenacity ovrebo unlikeable 
the kanye unswerving shouting mutilates 


The rare word ‘unflagging’ means never becomes weaker, ‘remonstrate’? means to complain, and 
‘reprobate’ means a person of bad character and habits. For each rare words, input embedding's nearest 
neighbors and output embedding's nearest neighbors have words with totally unrelated meanings. With our 
dual embeddings, AutoEncoder, SVD and 2tol, however, their nearest neighbors have related words with 
similar meanings. 


5. CONCLUSION 

We found the way to better represent the word in distributed vector representation by using both the 
input embedding and the output embedding from training Word2Vec. Different from other works, we used 
embeddings from just only one model, Word2Vec, by simply combining their input embedding and output 
embedding. It is remarkable that we used both the input and output embeddings, especially output embedding, 
which Word2Vec model abandons. We know that there are incredible works in recent days to represent words 
almost perfectly (e.g. BERT), but this method is, in no doubts, the most simple and fast way of representing 
words. We demonstrated with word analogy task, word similarity task, and nearest neighbors of the dual 
embeddings. Proposing several dual embeddings such as CONC, SUM, AE, SVD, and 2tol embeddings, we 
found various ways to represent the words. We leave it to further work to use these methods on various models. 
The state-of-the-arts models in word embeddings should have input embedding and output embedding when 
they train each model. It should be worth it to combine those two embeddings to get better performance than 
their own models. 
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